Improving sample efficiency in deep reinforcement learning based control of dynamic systems

Banerjee, Chayan

Title: Improving sample efficiency in deep reinforcement learning based control of dynamic systems
Creator: Banerjee, Chayan
Relation: University of Newcastle Research Higher Degree Thesis
Resource Type: thesis
Date: 2023
Description: Research Doctorate - Doctor of Philosophy (PhD)
Description: Actor-critic (AC) algorithms are a class of model-free deep reinforcement learning (DRL) algorithms that have proven their efficacy in diverse domains. Being model-free, they require many agent-environment interactions/ samples for policy learning. AC thus suffers from low sample efficiency, making it unsuitable for many real-world applications where samples are costly/ hazardous to obtain. Resolving this issue has been the topic of active research for quite some time. Sample inefficiency mitigation approaches can be typically classified under on-policy, off-policy, and exploration-boosting classes. Despite their effectiveness, these approaches suffer from several limitations and constraints. Our research is on the central theme of mitigating the low sample efficiency issue. We contribute to the aforementioned classes of traditional mitigation approaches and design new, more efficient algorithms. The thesis presents our three works, which seek to improve certain limitations of the three abovementioned classes of approaches. We improve the sample efficiency of an on-policy algorithm by optimizing the training dataset meant for the optimal policy network. The optimization comprises a best episode only operation, a policy parameter-fitness model, and a genetic algorithm module. Next, we introduce crucial modifications to boost the performance of an off-policy AC algorithm. The resulting algorithm features a novel prioritization scheme for selecting better samples from the experience replay buffer. It also uses a mixture of the prioritized off-policy data and the latest on-policy data for training the policy and the value function networks. Finally, regarding the exploration boosting approach, we propose a new algorithm to boost exploration through an intrinsic reward based on the measurement of a state's novelty and the associated benefit of exploring the state (with regard to policy optimization), altogether called plausible novelty. The algorithm can be paired with any off-policy AC algorithm to improve sample efficiency. All algorithms were extensively evaluated on the OpenAI Gym platform's benchmark environments. All the proposed algorithms performed substantially better than the conventional counterparts and successfully improved the sample efficiency of the algorithms.
Subject: reinforcement learning; off-policy learning; on-policy learning; soft actor-critic; exploration boosting; policy optimization; state novelty; intrinsic reward
Identifier: http://hdl.handle.net/1959.13/1477632
Identifier: uon:50012
Rights: In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of The University of Newcaslte's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink. If applicable, University Microfilms and/or ProQuest Library, or the Archives of Canada may supply single copies of the dissertation., Copyright 2023 Chayan Banerjee
Language: eng
Full Text

Hits: 876
Visitors: 977
Downloads: 209

		Thumbnail	File	Description	Size	Format
View Details Download			ATTACHMENT01	Thesis	12 MB	Adobe Acrobat PDF	View Details Download
View Details Download			ATTACHMENT02	Abstract	333 KB	Adobe Acrobat PDF	View Details Download